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Abstract 

Much information is stored in amino acid composition of protein and base composition of DNA. 
We simulated the evolution of amino acid frequencies and genomic GC content by a linguistic model. 
It is showed that the evolution of genetic code determines the evolution of amino acid frequencies 
and genomic GC content. We explained the relationships among amino acid frequencies, genomic GC 
content and protein length distribution in a unified theoretical framework. Especially, the simulations 
of the evolution of amino acid frequencies and the codon position GC content agree dramatically with 
the results based on the data of all known genomes so far. Furthermore, we found that the space 
of average protein length in proteome and ratio of amino acid frequencies is useful to describe the 
phylogeny and evolution. Amazingly, the dots of all the species in this space form an evolutionary 
flow. We believe that the amino acid gain and loss is motivated by the established pattern of the 
variation of amino acid frequencies. The linguistic mechanism is helpful to unveil the origin of the 
genetic code. 
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A 

i\_mino acid frequencies vary slightly among species, while genomic GC content varies greatly. And 
proteins from organisms with GC- (or AT-) rich genomes contain more (or fewer) amino acids encoded 
by GC-rich codons [!][!]. However, the mechanism of the variation of amino acid frequencies and 
genomic GC content was a long-standing and far-reaching problem. We did not realize the significance 
of this problem until we explained the profound relationships among amino acid frequencies, genomic 
GC content and protein length distribution in a unified theoretical framework. Its significance is 
similar to the role of the cosmic microwave background in physics, a relic of the evolution of the 
universe in early age. The evolution of amino acid frequencies and genomic GC content decouples 
from the evolution of the sequences of DNA and protein. Therefore rich information of the evolution 
of life in early time can be stored in the amino acid frequencies and genomic GC content of extant 
organisms, which are closely related to the origin of the genetic code. 

Before the life form based on DNA genomes and proteins, there may exist a simpler life form based 
primarily on RNA. This earlier era is referred to as the "RNA World" when the genetic alphabet of 
four bases is thought to have evolved [3] [4] [5]. The genetic code has been evolving in the context of 
such a genetic alphabet while the twenty amino acids joined protein sequences from the earliest to 
the latest [6] [7] [8] [9] . However, in absence of evidence, these problems and all that have to fall into 
a twilight zone of speculation and controversy today. Since no direct data of the evolution of amino 
acid frequencies and genomic GC content in early time remains today, we have to choose a proper 
evolutionary order for the extant organisms to discern the regularity of their evolution in early time. 

There are different theory about the origin of genetic code [10]. The frozen-accident theory 
proposed by Crick states that any change of the contemporary genetic code would be lethal unless 
many simultaneous mutations to alter the code [11] , In its extreme form, the theory implies that the 
allocation of codons to amino acids was entirely a matter of chance. While the stereochemical theory 
says that the code is universal because each amino acid fits its own anticodon or codon in some way. 
In its extreme form, the stereochemical theory is said to liken the genetic code to a "periodic table" in 
which the "polarity and bulkiness of amino acid side chains can be used to predict the anticodon with 
considerable confidence." |12j As the evolution of amino acid frequencies is concerned, some believe 
that it emerged before the last universal common ancestor of all extant organisms [13], while others 
believe that it conform with the standard, nearly neutral theoretical expectations [13] . Actually, it is 
routinely assumed that amino acid frequencies are constant [T5][TB]. As the difference of genomic GC 
content among species is concerned, it is usually explained as the biased AT/GC pressure exerted on 
the entire genome during the evolution [17] |18]. But the nature of prime biased AT/GC pressure was 
unknown. And the mechanism of the correlation of the GC content between total genomic DNA and 
the first, second, and third codon positions was also unknown. Finally as the distribution of protein 
length distribution is concerned, what causes the distribution has not been reported so far. 

Considering the analogy between biology and linguistics at the level of sequence, we proposed a 
linguistic model to explain all these phenomena. Storage and expression of the information are crucial 
in both biology and linguistics. Like the role of grammar in human language which can be viewed as a 
transformation of cell language [15] , linguistic rules are required to record information in the protein or 
DNA sequences. Many attempts have been made to combine linguistic theory to biology for predicting 
biological molecular (RNA, DNA and proteins) structures or trying to understanding protein assembly 
and function etc [20 21 22 23; . Here we focus on the protein linguistics at early time when genetic 
code evolved. The closeness of the result of our linguistic model to the experimental observations 
shows that the evolution of genetic code determines the evolution of amino acid frequencies and 
genomic GC content. So linguistics played a significant role in delivering the genetic information from 
the RNA world to the DNA-protein world; and the linguistic mechanism is important for revealing 
the formation of genetic code, predicting |20| protein structures and explaining diversification of life. 

Results 



Evolution of amino acid frequencies. We analyzed the amino acid frequencies for 106 species 
in database with Predictions for Entire Proteomes (PEP). Although the amino acid frequencies vary 
slightly, the pattern of the variation of amino acid frequencies among extant organisms is well- 
regulated. According to the consensus chronology of amino acids to recruit into the genetic code 
from the earliest to the latest [24]: G, A, D, V, P, S, E, L, T, R, Q, I, N, H, K, C, F, Y, M, W, we 
sort the 106 species by the ratio of average frequency for 10 later amino acids to average frequency 
for 10 earlier amino acids ("the ratio of amino acid frequencies" for short in the following). Thus, 
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Figure 1: Evolution of amino acid frequencies, (a) Evolution of amino acid frequencies based on 
the data of 106 species, where species are sorted from left to right for each amino acid by the ratio of 
average frequency for 10 later amino acids to average frequency for 10 earlier ones. Amino acids are 
aligned chronologically, (b) Simulation of the evolution of amino acid frequencies by the linguistic model, 
where variant t increases from left to right for each amino acid, (c) Evolution of amino acid frequencies 
of 106 species sorted by protein average length. 

there are 106 data aligning from left to right in the above order for each of the 20 amino acids. Then 
we obtain the evolutionary trends of amino acid frequencies: the frequencies of G, A, D, V, P, T, R, 
H, W decrease, while the frequencies of S, I, N, K, F, Y increase and the frequencies of E, L, Q, C, 
M do not vary obviously (Fig. la). These variations are amazingly monotonic by and large. And the 
magnitude of variation are different: frequencies of G, A, V, P, R decrease more rapidly than that of 
D, T, H, W, while the frequencies of I, N, K, F, Y increase more rapidly than that of S. Most of the 
amino acids whose frequencies decrease (increase) are among the 10 earlier (later) amino acids, but 
there are exceptions, i.e., H, W or S. The choice of order in the above procedure may influence the 
evolutionary trends, but the exceptions as well as the magnitudes and monotonicity of the variation 
can not be explained by this trivial reason. 

We also obtain the evolutionary trends for each of the three domains (eubacteria, archaebacteria, 
and eukaryotes) , where the evolutionary trends for eukaryotes and archaebacteria are decided roughly 
since there are only 7 eukaryotes and 12 archaebacteria in PEP. The evolutionary trends of the 
three domains are the same for each amino acid, but the initial amino acid frequencies of the three 
domains are different for each amino acid. Take G for example, the initial amino acid frequencies 
of eubacteria is greatest while the initial amino acid frequencies of eukaryote is least. This fact is 
related to the phylogeny tree of these three domains, where the deepest branching separated bacteria 
from the line leading to archaebacteria and eukaryote about 3.5 billion years ago and the divergence 
of archaebacteria and eukaryote occurred about 2.3 billion years ago [25]. The frequency of Q for 
archaebacteria is obviously less than that for the other two domains. 
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The results are the same for other orders by the ratio of average frequency for several later amino 
acids to average frequency for several earlier ones. If sorting the species by average protein length 
in the proteome (an order independent of the choice of amino acids), the evolutionary trends are 
still the same for most of the amino acids: only the evolutionary trends for P change (Fig. lc). (It 
should be noted that a few evolutionary trends are not obvious in this case because the monotonicity 
is worse than the monotonicity when sorting by the ratio of amino acid frequencies.) The reason is 
that there is a monotonous relationship (to be discussed in the following) between protein average 
length and ratio of amino acid frequencies, but the deviation of the species from the midstream of 
the evolutionary flow are serious (Fig. 4). 

We have observed regular variation of amino acid frequencies in the above. Either the order 
by ratio of amino acid frequencies or the order by average protein length reflects an evolutionary 
direction. Therefore, it is reasonable to assume that a profound mechanism underlies the evolution 
of amino acid frequencies. 



The linguistic model. The evolution of amino acid frequencies can be explained by our linguistic 
model, which combines linguistics and biology substantially. In terms of the tree of genetic code 
multiplicity and the genetic code chronology, we propose a model to simulate the generation of 
protein and DNA sequences by formal linguistics. The model consists of three parts: (i) generate 
protein sequence by tree adjoining grammar [26]; (ii) set amino acid for the leaves of grammars in (i) 
according to the tree of genetic code multiplicity (see Ref. [27], which can be obtained by analyzing 
the symmetries in the genetic code) with consideration of the amino acid chronology [53]; and (iii) 
translate the protein sequences to the DNA sequences according to genetic code chronology [28]. (To 
see materials and methods in detail) The evolution of genetic code is the core of the model. 

There is a variant t in the model, which determines the evolution of amino acid frequencies and 
accordingly represents the evolution time. A proteome for a species is defined as many a protein 
generated by the model with fixed t, so t also identifies species in the model. Thus, the amino acid 
frequencies and the average protein length for a species can be calculated. The evolutionary trends 
of the amino acid frequencies can be determined when proteomes are generated at different time t in 
the model. We can also simulate the evolution of genomic GC content after translating the protein 
sequences to DNA sequences. We do not distinguish the three domains in this model. 



Simulation of evolution of amino acid frequencies. The simulation of our linguistic model (Fig. 
lb) coincides with the global analysis of the data for 106 species (Fig. la), not only in evolutionary 
trends but in variation magnitudes for most of the amino acids. The frequencies of G, A, V, R 
decrease rapidly while frequencies of D, H, W decrease slowly; and the frequencies of I, K, F, Y 
increase rapidly while frequencies of S increase slowly. The frequencies of E, L, Q, C, M do not vary 
obviously. Especially, the simulation for the above exceptions (H, W, S) are good. The simulations 
for P, T, N, are not good enough. Such accordance between theory and biological data can not achieve 
easily unless we discover the right mechanism. 

Our model depends faithfully on the genetic code multiplicity in Ref. [27] and the amino acid 
chronology in Ref. [21], and no parameter is added on purpose in the model to alter the trend for 
a certain amino acid. Therefore, it is the evolution of genetic codes (the core of the model) that 
determines the evolution of amino acid frequencies. The position of amino acid in the tree of genetic 
code multiplicity is crucial to its probability to join the protein sequence when t increases, which 
causes different evolutionary trends of amino acid frequencies. For example, the positions for G and 
A or that for F and Y are equivalent on the tree of genetic code multiplicity, so the evolutionary 
trends for G and A or F and Y are the same. 

The 20 amino acid frequencies among the extant organisms are input as constant parameters. 
They represent the amount of amino acid in the primordial soap at the early time when life first 
appeared. In simulation, they are just the initial frequencies, when t starts, at the leftmost for 
each amino acid in Fig. lb. The evolutionary trends in simulation are not sensitive to the value of 
these 20 parameters. When adjusting them to certain extent, the evolutionary trends do not change. 
However, when modifying the genetic code multiplicity a little, the evolutionary trends change greatly 
and contradict the trends based on the data of 106 species. So the genetic code is the key in the 
linguistic mechanism. 

The magnitudes of variation in the simulation are slightly smaller, on the whole, than the mag- 
nitudes based on the data of 106 species (Fig. 1). So the evolution of amino acid frequencies should 
continue after the completion of genetic code, which contributes a small part to the whole variation. 
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Figure 2: (a) Relationship between genomic GC content and the ratio of average frequency for 10 later 
amino acids to average frequency for 10 earlier ones for the species in database PEP and its simulation 
by the linguistic model (red solid line) . (b) Simulation of the correlation of the GC content between total 
genomic DNA and the first, second, and third codon positions. 



Co-evolution of amino acid frequencies and GC content and codon position GC content. 

The relationship between genomic GC content and ratio of amino acid frequencies is regular for 
the species in database PEP: genomic GC content decreases linearly with the ratio of amino acid 
frequencies (Fig. 2a). The simulation of our model agrees qualitatively with this relationship (Fig. 
2a). In our model, the evolution of amino acid frequency and the evolution of genomic GC content 
are driven by a common variant £: when £ increases, there are more later amino acids recruit into 
the proteins and the genomic GC content decrease. So the amino acid frequencies and GC content 
co-evolved. A protein sequence generated at later time £ corresponds to the DNA sequence translated 
using the later codons, which results in the relationship between genomic GC content and ratio of 
amino acid frequencies. The simulated line does not accord with the optimal line of the dots of the 
extant organism because the magnitude of the simulated amino acid frequency evolution is less than 
the evolution based on the data of the extant organism. 

The simulation of the correlation of the GC content between total genomic DNA and the first, 
second, and third codon positions (Fig. 2b) also agrees with the results based on the data of organisms 
in Ref. 17 29], where the correlation slope of the third codon position is much greater than the 
correlation slopes of the first and the second positions and the correlation slopes of the first position 
is slightly greater than the correlation slope of the second position. In the table of codon chronology 
28 , G and C (A and U) occupy all the third positions of earliest (latest) codons for 20 amino acids, 
while the bases appear about equally for the first and second positions. Therefore, the correlation 
slope for the first and second positions vary slightly while the slope for the third position varies 
greatly. Hence the lower limit of the GC content (about 1/4) has to equal to one minus its upper 
limit (about 3/4). 

An the simulated results also agree in detail with the results based on the data of organisms. 
There is a fine structure, i.e., an 

"upward step" , in the middle of the line of the simulated GC content for the first codon position 
and a little "downward step" in the line of simulated GC content for the third codon position (Fig. 2b). 
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Figure 3: Protein length spectrum analysis, (a) PLSCC matrix for 106 species, (b) Average PLSCC. 

The predicted fine structure agree dramatically with the results based on 124 completed eubacterial 
genomes and 19 completed archaebacteria genomes [29]. A convex appears in the line of GC content 
for the first codon position and a little concave appears in the line of GC content for the third codon 
position (to see fig. 5 in Ref. [2pJ). The accordance is more obvious for the results based on 19 
completed archaebacteria genomes. And the lower limits of the GC content for the first and second 
codon position are equal in simulation, which agrees with the results based on data of organisms. 
These characters also agree with fig. 2 of Ref. [T7] based on the data of 11 species. 



Protein length spectrology. The linguistic mechanism can also be supported by the characters of 
protein length spectrum, or distribution of protein length. The outline of the spectrum likes a bell 
on the whole. But violent fluctuations can be observed all over the spectrum for each species. When 
the grammar rules changed in simulation, the spectrum also changed. Therefore the fluctuation of 
the spectrum is not stochastic, which is related to the protein linguistic rules. Some structures, e.g. 
periodic- like fluctuations [50], can be observed in the spectra. But we do not regard it as periodicity. 
Indeed it is also related to the underlying grammar rules. A concave appears in protein length 
spectrum of each domain at the length near 200 aa [15] , which might be related to DNA circularization. 
The main characters of the spectrum (including bell-shape outline, intrinsic fluctuations, periodic-like 
structures) can be simulated by the linguistic model. 

In addition, the relationship between species can be inferred by their protein length spectra. The 
more closely related pair of species possesses higher protein length spectrum correlation coefficient 
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(PLSCC) (defined in materials and methods). The image of the PLSCC matrix is indeed a global 
landscape illustrating the relationship among all species (Fig. 3a), where the "folded mountains" 
reflect the branching in the phylogenetic tree. For example, the number of 71, 89, 94, 95, 97 represent 
mycoplasma, which situate at the far end on the phylogeny tree, so the PLSCC for these species are 
small (forming a "valley" in the landscape). 



The evolutionary flow. We also find the relationship between the average protein length and 
the ratio of average frequency for several later amino acids to average frequency for several earlier 
ones. The species of three domains gather together in different regions in the space of the average 
protein length and the ratio of amino acid frequencies (Fig. 4), which supports the three-domain 
classification 31 32 . The distance for closely related species in this space is small; for instance, the 
point for human and the point for mouse in the space are near. 

The distribution of all species in the space is a bowed line on the whole (Fig. 4), which is indeed 
an evolutionary flow for the following reasons. The species with large (small) genome or with big 
(small) average PLSCC locate in the midstream (margin) of the flow (Fig. 4); and most of the 
PLSCC directions (defined in materials and methods) amazingly parallel with the direction of the 
flow by and large (Fig. 4). And the evolutionary direction of the flow is from the simple organisms 
toward the complex organisms. The bowed evolutionary flow can be simulated by our model (Fig. 
4, Embedded) , where the time t gives the right evolutionary direction of the flow. That the range of 
ratio of amino acid frequencies are smaller than the range for 106 species is also due to the smaller 
simulated magnitude of evolution of amino acid frequencies. The bending direction in the simulation 
agrees with the flow based on the data of 106 species, which is sensitive to the exact form of genetic 
code multiplicity. Such accordance between theory and data of 106 species confirms the linguistic 
mechanism of the protein evolution. 

In terms of the linguistic model, the variation of amino acid frequencies and genomic GC content 
developed mainly in the time when genetic code evolved. At that time, the three domain had not 
separated and most of the extant organisms did not appear, so the evolutions of amino acid frequencies 
for the three domains are the same. Thus, the pattern of the variation of amino acid frequencies among 
species must form before the last universal common ancestor of all extant organisms. Whereafter 
in the modern evolution, the amino acid frequencies should continue to evolve motivated by the 
established pattern of the difference in amino acid frequencies. Such a motivation is intrinsic for 
the evolution of genes. The evolution of life is a procedure to keep balance between the restricts by 
the intrinsic rules of the genes and the demand for species to adapt the vicissitudes in surroundings. 
Although the genes are stored in separate individuals, they are tightly connected by food chain and 
sexual network. So the intrinsic rules of the genes are reflected by the evolutionary flow, and the 
general pattern of the variation of amino acid frequencies in early time has been inherited amongst 
the extant organisms. 

Compared sets of orthologous proteins encoded by triplets of closely related genomes from 15 
taxa, Jordan et al found that C(0.45), M(0.22), H(0.18), S(0.13), F(0.12) accrue in at least 14 taxa, 
whereas P(-0.36), A(-0.16), E(-0.15), Q(-0.10) are consistently lost, for which they did not give reason 
(The data in the parentheses reflect the magnitudes of gain (plus) and loss (minus) of amino acids) 
[13] . Their result of the modern evolution of amino acid frequencies agrees with our result of the 
primordial evolution of amino acid frequencies in principle. We believe that the motivation of the 
amino acid gain and loss is the "pressure" by established pattern of the variation of amino acid 
frequencies whose direction parallels with the direction of evolutionary flow. The frequencies of P 
and A are just decrease rapidly, while the frequencies of S and F increase. The magnitudes of gain for 
C and M are considerable great, but the frequencies of them do not vary obviously in the primordial 
evolution. This disagreement might be explained by the suggestion that they are late-coming amino 
acids [33] [34]. 

Discussion 

"The starting point of a biological investigation will be theoretical." [35] Our work is a successful 
example to study biology based on general principles. It must be emphasized that the explanations in 
the above are systematical. We have explained those phenomena in a unified theoretical framework. 
Why are there many linear or quasi-linear relationships, such as those manifested in Fig la, Fig. 
2a, Fig. 2b, Fig. 4? The reason can be shown by the linguistic model: there is only one adjustable 
variant, i.e. time £, in the model which drives the evolution of all quantities. Our work shows that the 
origin and evolution of life is determined by a definite mechanism. Otherwise we can not imagine that 
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Figure 4: The evolutionary flow. Relationship between average protein length and ratio of frequencies 
(here choosing H, Q, W among later amino acids and G, V among earlier ones) for 106 species (archae- 
bacteria: blue square, eubacteria: star, virus: black square, eukaryote: red square). The proteome size 
is represented by the colour of tail below each point of species (big: red, small: blue). The green (cae- 
sious) stars denote species with big (small) average PLSCC. The black arrows denote PLSCC directions. 
(Embedded) simulation of the bowed evolutionary flow. 



all the dots in the above figures align with lines or somewhat curves obediently. If there existing a 
mechanism which governs the origin and evolution of life, life is indeed a consequence of the evolution 
of matter and is doomed to appear in the universe. If the linguistic mechanism is universal, the life 
must be universal in the deep sky. 

An important property of the model is that the parameters of amino acid frequencies are constant, 
which indicates that the abundance of amino acids in the surroundings had no time to change when 
forming the differences of amino acid frequencies among species. Therefore the variation of amino 
acid frequencies should develop during a short time in the transition from the RNA world to the 
DNA world. We conjecture that there being many life groups governed by different possible linguistic 
systems at early time on this planet, but one system prevailed over others spontaneously as symmetry 
breaking. Thus selecting genetic code was a matter of chance. All the amino acids in the surroundings 
were used up to generate proteins in terms of the contemporary genetic code, while other linguistic 
systems demised. 

Crick's frozen-accident theory states that: "...To account for it being the same in all organisms 
one must assume that all life evolved from a single organism (more strictly, from a single closely 
interbreeding population)." [TT] Our theory differs with Crick's theory. We suggest that a linguistic 
system rather than a species was selected by chance to produce the universal genetic code. There 
is an analogy between our theory and the evolution of natural language. Which language among 
the "ecology" of natural languages become the international language was a matter of chance in 
the history. Our theory can explain the exceptions to the universal genetic code which are utilized 
in mitochondrial and in principal genomes of certain species: they might be the antique of other 
linguistic systems. This phenomenon is familiar in natural languages. Many ancient languages have 
demised, but some ancient grammar rules may keep in the contemporary language. 

We suggest that there is an intrinsic relationship between the laws of life and the laws of matter, 
which determines the proper scale of us comparing with the scale of the earth. Information plays the 
central role in biology as well as in physics |36| |37| . There is an upper limit of information to store 
in a certain space [35]. Here we present a heuristic ideal experiment. Let a biologist and a physicist 
drink together in a room, where enough hard disks are stored so that the amount of the information 
in this room is just below the critical value of the upper limit. When inspiring new ideas come to 
the brains of the two scientists continuously, the amount of the information must be able to exceed 
the critical value, while the mass in the room is conservative. The volume of the room must inflate 
driven by the increasing information. It infers that the information is equivalent to mass to some 
extent. Therefore, the life fight for a decent space to live in this universe for themselves. If a universe 
is too small, the information created by all the life in this universe must drive it to expand by some 
unknown mechanism. Maybe the accelerating expansion of our universe is propelled partly by the 
activities of the life. The genetic code bridges the world of matter and the world of life. A general 
theory is expected to push down the wall between physics and biology so that we can consider the 
phenomena of life and matter in a unified theory. 



Materials and methods 



Data collection. The amino acid frequencies and average protein lengths for 106 species (85 eu- 
bacteria, 12 archaebacteria, 7 eukaryotes and 2 viruses) are obtained based on the data in PEP on 
URL: http: / / cubic.bioc.columbia.edu/pep| The GC contents are obtained from Genome Properties 
system 39 . These species are representative for all the species on the earth in studying the evolution 
of amino acid frequencies and genomic GC content. 



Linguistic rules and parameters. For three parts of the model: (i) There are one initial tree and 
two auxiliary trees in tree adjoining grammar. The leaf on the tree of adjoining grammar stands for 
amino acid, which is determined by (ii). The variant t (the only adjustable parameter) represents the 
probability for the replacement in adjoining operation [21]. (ii) D, P, F, Q, M, V, S and termination 
code are at positions of inner node on the tree based on genetic code multiplicity, while G, A, E, L, 
T, R, I, N, H, K, C, Y, W are at positions of leaf on the tree (Fig. 5). The 20 amino acid frequencies 
at present are input as constant parameters in the model, which represent the amount of amino acids 
in surroundings and determine the probabilities (also being constant) of selecting amino acids at each 
fork of the tree. The amino acid to join the protein sequences in (i) is selected randomly from root to 
a leaf along the branches of the tree in first loop of the program and is selected randomly from root 
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Figure 5: Grammar tree of genetic code multiplicity. The 20 amino acids are aligned chronologically 
from left to right. This picture is faithfully base on the genetic code multiplicity in Ref. [27] and amino 
acid chronology in Ref. |24j . The amino acids are selected to join the protein sequence depended strictly 
on this tree in the program for the model. 

or a stochastic inner node to a leaf in the following loops. The probability of starting from an inner 
node other than root in the following loops increases linearly with t, which results in the evolution 
of amino acid frequencies in simulation, (iii) The degenerate genetic codes are used from the earliest 
to the latest chronologically 28 for each amino acid as t increases. 



Simulation of sequence generation. Starting from the initial tree, protein sequence can be 
generated by adjoining operations. There are two steps for the generation of protein in our model. 
Mini- proteins with length about 10 aa are generated in the first step; then the whole protein is 
obtained by connecting mini-proteins in the second step (also based on tree adjoining grammar) 
when the protein length increases proportionally. The amino acid frequencies do not change in the 
second step. The simulations in Fig. 1, Fig. 2 and Fig. 4 are obtained by calculating 50,000 generated 
mini-proteins to avoid stochastic error for 30 species when increases from 0.02 to 0.40 by equal steps. 
Where the greater t is avoid because greater t results in too long proteins in the program. 



Protein length spectrum analysis. Protein length spectrum is obtained by counting the numbers 
of proteins with a certain length in each of the 106 proteomes. Protein length spectra of three domains 
are obtained by summing up all protein length spectra of the species in each domain. Simulated 
protein length spectrum is obtained when generating a certain amount of proteins. 

PLSCC is defined as the inner product of a pair of normalized protein length spectra (taking each 
protein length spectrum for a vector), therefore it is just the cosine of the angle between the two 
vectors of the spectra for the pair of species. Average PLSCC is obtained by averaging the PLSCC's 
between a species and all the other species (Fig. 3b). The 106 x 106 PLSCC matrix is obtained by 
calculating PLSCC for each pair among 106 complete proteomes (the species are sorted by average 
protein length). PLSCC direction is defined as the direction along which average PLSCC's for a group 
of closely related species (represented by overlapping different shapes on stars in Fig. 4) decrease on 
the whole, which reflects direction of evolution. 



We thank Shengli Zhang, Zhenwei Yao, Hefeng Wang, Qinye Yin for discussions and Zhiwei Li and 
Philip Carter for assistances. 
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